πŸŽ™οΈ COMPLETE ROADMAP: Building Text-to-Speech (TTS) & Speech-to-Text (STT) Models & Services

From Scratch to Production β€” Beginner β†’ Advanced β†’ Research Level

Roadmap Version: 2025 | Last Updated: March 2025 | Author Note: Covers 2–3 years of full-stack speech AI development.

1. FOUNDATION PREREQUISITES

1.1 Mathematics

  • Linear Algebra: Vectors, matrices, dot products, SVD, eigenvalues
    • Why: All neural networks are matrix operations
  • Calculus: Derivatives, gradients, chain rule, partial derivatives
    • Why: Backpropagation relies on chain rule
  • Probability & Statistics: Gaussian distributions, Bayesian inference, MLE, MAP
    • Why: Acoustic models are probabilistic; language models use probability
  • Signal Processing Mathematics
    • Fourier Transform (DFT, FFT)
    • Convolution theorem
    • Z-Transform
    • Nyquist-Shannon sampling theorem
    • Windowing functions (Hamming, Hann, Blackman)

1.2 Programming Languages

  • Python (primary β€” 90% of ML/audio research)
    • NumPy, SciPy, Matplotlib
    • OOP, functional patterns, async programming
  • C++ (for low-latency inference engines)
  • JavaScript/TypeScript (for web APIs and browser-based STT/TTS)
  • Shell/Bash (for pipeline automation, data processing)

1.3 Deep Learning Foundations

  • Forward & backward propagation
  • Activation functions (ReLU, GELU, Sigmoid, Softmax)
  • Optimizers (SGD, Adam, AdamW, Lion)
  • Regularization (Dropout, BatchNorm, LayerNorm, Weight Decay)
  • Loss functions (Cross-entropy, CTC, MSE, L1)
  • Sequence modeling fundamentals

1.4 Audio/Signal Processing Basics

  • What is sound? Pressure waves, frequency, amplitude
  • Sample rate (8kHz, 16kHz, 22.05kHz, 44.1kHz, 48kHz)
  • Bit depth (8-bit, 16-bit, 32-bit float)
  • Mono vs stereo
  • Audio file formats: WAV, MP3, FLAC, OGG, OPUS
  • Waveform representation
  • Time domain vs frequency domain

2. CORE CONCEPTS & WORKING PRINCIPLES

2.1 How Human Speech Works

Lungs β†’ Air pressure β†’ Vocal cords vibrate β†’ Resonates in vocal tract β†’ Articulators shape sound (tongue, lips, teeth) β†’ Acoustic wave β†’ Air β†’ Ear
  • Phonemes: Smallest units of sound (~44 in English)
  • Prosody: Rhythm, stress, intonation, tempo
  • Coarticulation: Phonemes influence neighboring sounds
  • Formants: Resonant frequencies of the vocal tract (F1, F2, F3...)

2.2 Speech-to-Text (STT) β€” Working Principle

Audio Input β†’ Pre-processing β†’ Feature Extraction β†’ Acoustic Model β†’ Language Model β†’ Decoder β†’ Text Output

Step-by-step:

  1. Microphone captures pressure variations β†’ digital signal (waveform)
  2. Pre-process: remove noise, normalize, apply VAD (Voice Activity Detection)
  3. Extract features: convert raw audio to MFCCs, Mel Spectrograms, or raw waveform
  4. Acoustic model: predict phoneme/subword probabilities at each timestep
  5. Language model: rescore sequences based on linguistic probability
  6. Decoder: find most likely word sequence (Viterbi, Beam Search, CTC Greedy)
  7. Post-processing: punctuation restoration, capitalization, speaker labeling

2.3 Text-to-Speech (TTS) β€” Working Principle

Text Input β†’ Text Analysis β†’ Linguistic Features β†’ Acoustic Model β†’ Vocoder β†’ Audio Waveform Output

Step-by-step:

  1. Input text normalization (numbers β†’ words, abbreviations β†’ full form)
  2. G2P (Grapheme-to-Phoneme): convert letters to phonemes
  3. Prosody prediction: duration, pitch, energy per phoneme
  4. Acoustic model: generate mel spectrogram from linguistic features
  5. Vocoder: convert mel spectrogram to raw audio waveform
  6. Post-processing: audio normalization, format encoding

2.4 Key Audio Representations

Representation Description Used In
Raw Waveform Time-domain amplitude samples WaveNet, WaveGlow, Encodec
STFT Spectrogram Frequency vs time (complex) Analysis, source separation
Mel Spectrogram Perceptually-scaled frequency Tacotron, Whisper, FastSpeech
MFCC Compressed mel cepstral coefficients Traditional ASR, GMM-HMM
Log-Mel Log of mel spectrogram Whisper, wav2vec 2.0
Codec Tokens Discrete audio tokens EnCodec, SoundStream, VALL-E

3. STRUCTURED LEARNING PATH

PHASE 0: Signal Processing Foundations (4–6 weeks)

  • Topic 1: Digital Audio Fundamentals
    • Sampling and quantization
    • Aliasing and anti-aliasing filters
    • PCM encoding
    • Practice: Load WAV files, plot waveforms with librosa/scipy
  • Topic 2: Fourier Analysis
    • Discrete Fourier Transform (DFT)
    • Fast Fourier Transform (FFT) β€” Cooley-Tukey algorithm
    • Short-Time Fourier Transform (STFT)
      • Window size (frame length), hop size, overlap
      • Griffin-Lim reconstruction algorithm
    • Practice: Compute STFT, plot spectrograms, reconstruct audio
  • Topic 3: Mel Scale & Perceptual Features
    • Mel filter banks (triangular filters on Mel scale)
    • MFCC computation pipeline:
      1. Pre-emphasis filter
      2. Framing + windowing
      3. FFT
      4. Mel filter bank
      5. Log compression
      6. DCT (Discrete Cosine Transform)
    • Delta and delta-delta features
    • Practice: Implement MFCC from scratch without librosa
  • Topic 4: Audio Pre-processing
    • Noise reduction (spectral subtraction, Wiener filter)
    • Voice Activity Detection (VAD) β€” energy-based, WebRTC VAD, Silero VAD
    • Audio normalization (peak, RMS, LUFS)
    • Resampling (polyphase filters)
    • Practice: Build an audio pre-processing pipeline

PHASE 1: Classical Speech Processing (3–4 weeks)

  • Topic 5: Hidden Markov Models (HMM)
    • Markov chains and state transitions
    • HMM components: states, observations, transition matrix, emission matrix
    • Three HMM problems:
      1. Evaluation β€” Forward algorithm
      2. Decoding β€” Viterbi algorithm
      3. Learning β€” Baum-Welch (EM algorithm)
    • HMM for phoneme modeling
    • Practice: Implement HMM for digit recognition
  • Topic 6: Gaussian Mixture Models (GMM)
    • Mixture of Gaussians
    • EM algorithm for GMM training
    • GMM-HMM acoustic models
    • Speaker adaptation: MLLR, MAP adaptation
  • Topic 7: N-gram Language Models
    • Unigram, bigram, trigram
    • Perplexity metric
    • Smoothing: Laplace, Kneser-Ney, Good-Turing
    • ARPA format language model files
  • Topic 8: Classical Vocoders (TTS)
    • Formant synthesis (rule-based)
    • Concatenative TTS: unit selection
    • STRAIGHT vocoder
    • WORLD vocoder (F0 + spectral envelope + aperiodicity)
    • Practice: Use WORLD vocoder to analyze and resynthesize speech

PHASE 2: Deep Learning for Speech (6–8 weeks)

  • Topic 9: Recurrent Neural Networks
    • Vanilla RNN and vanishing gradient problem
    • LSTM (Long Short-Term Memory):
      • Input gate, forget gate, output gate, cell state
    • GRU (Gated Recurrent Unit)
    • Bidirectional RNNs
    • Practice: Build sequence-to-sequence model for toy TTS
  • Topic 10: Convolutional Neural Networks for Audio
    • 1D convolution for raw waveform
    • 2D convolution for spectrograms
    • Dilated causal convolutions (key for WaveNet)
    • Depthwise separable convolutions
    • Practice: Build CNN-based phoneme classifier
  • Topic 11: Attention Mechanisms
    • Dot-product attention
    • Scaled dot-product attention
    • Multi-head attention
    • Self-attention vs cross-attention
    • Location-sensitive attention (Tacotron)
    • Practice: Implement attention from scratch
  • Topic 12: Transformer Architecture
    • Encoder-Decoder structure
    • Positional encoding (sinusoidal, learned, RoPE, ALiBi)
    • Feed-forward networks
    • Layer normalization (Pre-LN vs Post-LN)
    • Masked attention for autoregressive decoding
    • Practice: Train a small Transformer on character sequences
  • Topic 13: Connectionist Temporal Classification (CTC)
    • The alignment problem in speech recognition
    • CTC forward algorithm
    • CTC loss and gradient
    • CTC greedy and beam search decoding
    • CTC + language model rescoring
    • Practice: Train CTC model on TIMIT dataset

PHASE 3: Modern STT Systems (8–10 weeks)

  • Topic 14: End-to-End ASR Architectures
    • Listen, Attend and Spell (LAS)
    • Deep Speech 1 & 2 (Baidu)
    • Jasper, QuartzNet (NVIDIA)
    • Conformer (combining CNN + Transformer)
    • Architecture comparison: CTC vs Attention vs RNN-T
  • Topic 15: Self-Supervised Learning for Speech
    • Contrastive Predictive Coding (CPC)
    • wav2vec / wav2vec 2.0 (Facebook/Meta)
      • CNN feature encoder + Transformer context network
      • Quantization module (product quantization)
      • Contrastive loss with negative sampling
    • HuBERT (Hidden Unit BERT)
      • Offline clustering β†’ pseudo-label generation
      • BERT-style masked prediction
    • WavLM: wav2vec 2.0 + denoising objective
    • Practice: Fine-tune wav2vec 2.0 on custom dataset
  • Topic 16: Whisper (OpenAI)
    • Architecture: Encoder-Decoder Transformer
    • Training data: 680,000 hours weakly supervised
    • Input: 30-second log-Mel spectrogram (80 channels)
    • Multitask training: transcription + translation + language ID + VAD
    • Tokenizer: BPE with multilingual vocabulary
    • Model sizes: tiny(39M), base(74M), small(244M), medium(769M), large(1.5B)
    • Practice: Deploy Whisper, fine-tune on domain-specific data
  • Topic 17: RNN-T (Recurrent Neural Network Transducer)
    • Encoder (audio) + Prediction network (text) + Joint network
    • Transducer loss function
    • On-device streaming ASR
    • Used by: Google, Apple, Amazon Alexa
    • Practice: Train small RNN-T on LibriSpeech subset
  • Topic 18: Streaming & Real-Time ASR
    • Chunk-based processing
    • Latency vs accuracy tradeoff
    • Lookahead context
    • Cache-aware streaming Conformer
    • CTC prefix beam search for streaming
    • Practice: Build real-time transcription with WebRTC + Whisper

PHASE 4: Modern TTS Systems (8–10 weeks)

  • Topic 19: Neural TTS Pipeline
    • Text normalization (written β†’ spoken form)
      • Number normalization
      • Abbreviation expansion
      • Date/time normalization
    • G2P (Grapheme-to-Phoneme):
      • Rule-based (CMU Pronouncing Dictionary)
      • Sequence-to-sequence G2P
      • Transformer G2P
    • Phoneme inventory and IPA
    • Prosody: F0 (pitch), duration, energy
  • Topic 20: Tacotron & Tacotron 2
    • Tacotron 1: CBHG + attention + Griffin-Lim
    • Tacotron 2:
      • Encoder: Conv layers + BiLSTM
      • Attention: Location-sensitive
      • Decoder: Autoregressive LSTM β†’ mel spectrogram
      • Stop token prediction
      • WaveNet vocoder
    • Practice: Train Tacotron 2 on LJ Speech dataset
  • Topic 21: FastSpeech & FastSpeech 2
    • FastSpeech 1: Knowledge distillation from autoregressive teacher
      • Feed-forward Transformer (FFT)
      • Length regulator (phoneme duration)
      • Parallel mel generation (non-autoregressive)
    • FastSpeech 2: No teacher-forcing
      • Duration predictor
      • Pitch predictor (F0)
      • Energy predictor
      • Variance adaptor
    • Speed: 270x faster than Tacotron for inference
    • Practice: Train FastSpeech 2 on LJ Speech
  • Topic 22: VITS (Variational Inference TTS)
    • End-to-end: text β†’ waveform in one model
    • Components: posterior encoder, prior encoder, decoder (HiFi-GAN)
    • Variational autoencoder (VAE) latent space
    • Normalizing flows (affine coupling layers)
    • GAN training for waveform quality
    • Stochastic duration predictor
    • Practice: Train VITS, experiment with fine-tuning on custom voice
  • Topic 23: Neural Vocoders
    • WaveNet: Autoregressive dilated causal CNN, slow but high quality
    • WaveGlow: Normalizing flow, parallel generation
    • MelGAN: GAN-based, fast, lightweight
    • HiFi-GAN: Multi-period discriminator + multi-scale discriminator, best quality/speed
    • BigVGAN: Large-scale HiFi-GAN with anti-aliased activations
    • EnCodec: Neural audio codec (RVQ-based), used as tokenizer
    • Practice: Train HiFi-GAN on LJ Speech
  • Topic 24: Voice Cloning
    • Speaker embeddings: d-vector, x-vector, ECAPA-TDNN
    • Speaker verification vs identification
    • Zero-shot voice cloning: YourTTS, XTTS, OpenVoice
    • Few-shot voice cloning: 3–10 seconds of reference audio
    • Speaker encoder: GE2E loss (generalized end-to-end loss)
    • Practice: Implement zero-shot voice cloning with XTTS

PHASE 5: Large-Scale Models & Advanced Techniques (8–12 weeks)

  • Topic 25: Language Models for TTS & STT
    • VALL-E: TTS as a language modeling task
      • EnCodec tokens (8 RVQ levels)
      • AR model for coarse tokens + NAR for fine tokens
      • In-context learning for voice cloning
    • AudioLM: Audio continuation using hierarchical tokens
    • SoundStorm: Non-autoregressive audio generation
    • Voicebox: Flow-matching-based TTS
  • Topic 26: Diffusion Models for Speech
    • Score-based generative models
    • Denoising Diffusion Probabilistic Models (DDPM)
    • DiffWave: diffusion-based vocoder
    • Grad-TTS: diffusion-based acoustic model
    • Stable Diffusion concepts applied to audio
    • DDIM sampling for fast inference
  • Topic 27: Flow Matching
    • Continuous normalizing flows
    • Flow matching vs diffusion: faster training, ODE-based
    • Voicebox (Meta): flow matching for TTS
    • Matcha-TTS: ODE-based TTS
    • E2-TTS / F5-TTS: flow matching with flat text input
  • Topic 28: Multilingual & Code-Switching
    • Multilingual acoustic models
    • Language identification integration
    • Code-switching (mixing languages mid-sentence)
    • MMS (Meta Massively Multilingual Speech): 1000+ languages
    • Cross-lingual transfer learning
    • Low-resource language adaptation
  • Topic 29: Emotion & Style Control
    • Emotion embeddings (happy, sad, angry, neutral...)
    • Global Style Tokens (GST)
    • Reference audio-based style transfer
    • Prosody transfer
    • Voice conversion (change voice, keep content)
    • Practice: Build emotion-controlled TTS using GST-Tacotron

PHASE 6: Production & MLOps (4–6 weeks)

  • Topic 30: Model Optimization
    • Quantization: INT8, INT4, dynamic quantization
    • Pruning: structured, unstructured, magnitude-based
    • Knowledge distillation for smaller models
    • ONNX export and ONNX Runtime
    • TensorRT optimization (NVIDIA)
    • OpenVINO (Intel)
    • Edge deployment: TFLite, CoreML, NCNN
  • Topic 31: Inference Optimization
    • Batching strategies (dynamic batching)
    • Caching (KV cache, encoder cache)
    • Speculative decoding
    • CTranslate2 for faster Transformer inference
    • Triton Inference Server
    • TorchScript and torch.compile
  • Topic 32: Service Architecture
    • REST API design (FastAPI, Flask)
    • WebSocket for real-time streaming
    • gRPC for high-performance RPC
    • Message queues (RabbitMQ, Kafka) for async processing
    • Load balancing and horizontal scaling
    • Rate limiting and API key management
    • CDN for audio delivery
  • Topic 33: MLOps Pipeline
    • Experiment tracking: MLflow, Weights & Biases
    • Data versioning: DVC
    • Model registry and versioning
    • CI/CD for ML models
    • Monitoring: model drift, latency, error rate
    • A/B testing for TTS quality
    • Data flywheel and continuous improvement

4. ALGORITHMS, TECHNIQUES & TOOLS

4.1 Core Algorithms

STT Algorithms
Algorithm Type Key Use
Viterbi Dynamic Programming HMM decoding, best path
Baum-Welch EM HMM training
CTC Forward-Backward DP CTC loss computation
Beam Search Tree Search Sequence decoding
Prefix Beam Search Tree Search CTC with LM integration
WFST (Weighted FST) Graph Kaldi-style decoding
BPE (Byte Pair Encoding) Tokenization Subword vocabulary
Word2Vec/FastText Embedding Text representation
Forced Alignment DP Aligning audio to transcripts
TTS Algorithms
Algorithm Type Key Use
Griffin-Lim Phase reconstruction Spectrogram β†’ waveform
WORLD vocoder Signal processing Parametric voice synthesis
VAE Generative Latent space for style
Normalizing Flows Generative Invertible transformations
GAN Generative Waveform generation, vocoders
DDPM Generative Diffusion vocoders
Flow Matching Generative Fast TTS (F5-TTS, Voicebox)
RVQ (Residual Vector Quantization) Compression Audio tokenization

4.2 Neural Network Architectures

  • CNN: WaveNet, DeepSpeech, Jasper, QuartzNet
  • LSTM/GRU: Tacotron, early E2E ASR
  • Transformer: Whisper, FastSpeech, wav2vec 2.0
  • Conformer: SOTA for ASR (CNN + Self-attention hybrid)
  • Diffusion U-Net: DiffWave, Grad-TTS
  • Flow network: WaveGlow, Glow-TTS, VITS
  • Codec model: EnCodec, SoundStream, DAC

4.3 Training Techniques

  • Teacher Forcing: train decoder with ground truth
  • Scheduled Sampling: gradually mix teacher/model predictions
  • Knowledge Distillation: teacher-student training
  • Contrastive Learning: wav2vec, SimCLR-style
  • Multi-task Learning: Whisper (transcription + translation + LID)
  • Transfer Learning: fine-tune pretrained models
  • Data Augmentation:
    • SpecAugment (time/frequency masking)
    • Speed perturbation (0.9x, 1.0x, 1.1x)
    • Room Impulse Response (RIR) convolution
    • Additive noise (MUSAN, AudioSet)
    • Pitch shifting, time stretching

4.4 Python Libraries & Frameworks

Audio Processing
librosa β€” Audio analysis, feature extraction, visualization soundfile β€” Read/write audio files (WAV, FLAC, OGG) pydub β€” Audio manipulation (cut, join, convert) scipy.signal β€” Signal processing primitives torchaudio β€” PyTorch audio I/O and transforms audioread β€” Backend-agnostic audio reading pyworld β€” Python wrapper for WORLD vocoder resampy β€” High-quality audio resampling webrtcvad β€” Google's WebRTC VAD silero-vad β€” Neural VAD (accurate, fast)
Deep Learning
PyTorch β€” Primary framework for research/production TensorFlow β€” Production, mobile (TFLite) JAX/Flax β€” Google research framework HuggingFace Transformers β€” Pre-trained models hub HuggingFace Datasets β€” Dataset loading/processing
STT-Specific
openai-whisper β€” OpenAI Whisper (all sizes) faster-whisper β€” CTranslate2-optimized Whisper (4x faster) whisperx β€” Whisper + word-level alignment nemo (NVIDIA NeMo) β€” ASR, TTS, NLP toolkit espnet β€” End-to-end speech processing kaldi β€” Classical + hybrid ASR speechbrain β€” PyTorch speech toolkit wav2letter++ β€” Meta's ASR toolkit deepgram β€” Commercial STT API (also research) vosk β€” Offline STT (lightweight)
TTS-Specific
TTS (Coqui) β€” Open-source TTS: Tacotron, VITS, XTTS espeak-ng β€” Lightweight rule-based TTS (G2P) pyttsx3 β€” Offline TTS wrapper bark β€” Suno's generative TTS (GPT-style) tortoise-tts β€” Slow but high-quality multi-voice TTS XTTS / Coqui XTTS β€” Multilingual voice cloning (VITS-based) StyleTTS2 β€” Style-based TTS (SOTA on LJ Speech) parler-tts β€” Description-controlled TTS kokoro-tts β€” Lightweight high-quality TTS
Serving & Infrastructure
FastAPI β€” Async Python web framework uvicorn β€” ASGI server triton β€” NVIDIA model serving onnxruntime β€” Cross-platform model inference ctranslate2 β€” Efficient Transformer inference ray serve β€” Distributed model serving celery β€” Async task queue redis β€” Caching, pub/sub, queue

5. ARCHITECTURE DEEP DIVE

5.1 STT Architecture Family Tree

Classical Era β”œβ”€β”€ GMM-HMM (1990s–2010s) β”‚ β”œβ”€β”€ Feature: MFCC β”‚ β”œβ”€β”€ Acoustic: GMM per HMM state β”‚ └── Decoder: Viterbi + N-gram LM β”‚ β”œβ”€β”€ DNN-HMM (2012–2016) β”‚ β”œβ”€β”€ Feature: MFCC / fbank β”‚ β”œβ”€β”€ Acoustic: DNN replaces GMM β”‚ └── Decoder: Viterbi + WFST β”‚ End-to-End Era β”œβ”€β”€ CTC-Based (2014–2019) β”‚ β”œβ”€β”€ DeepSpeech 1 & 2: RNN + CTC β”‚ β”œβ”€β”€ Jasper: CNN + CTC β”‚ └── QuartzNet: Depthwise sep CNN + CTC β”‚ β”œβ”€β”€ Attention-Based (2016–2020) β”‚ β”œβ”€β”€ LAS: LSTM encoder + attention decoder β”‚ └── Transformer ASR: Self-attention encoder + decoder β”‚ β”œβ”€β”€ Hybrid CTC-Attention (2017–present) β”‚ └── ESPnet models, Conformer β”‚ Self-Supervised Era β”œβ”€β”€ wav2vec 2.0 (2020): CNN + Transformer + contrastive β”œβ”€β”€ HuBERT (2021): CNN + Transformer + BERT-style β”œβ”€β”€ WavLM (2022): HuBERT + denoising └── Whisper (2022): Supervised multitask, Enc-Dec Transformer β”‚ Streaming / On-device β”œβ”€β”€ RNN-T: encoder + predictor + joiner β”œβ”€β”€ Streaming Conformer: chunk-based └── Distil-Whisper: 6x faster distilled version

5.2 Conformer Architecture (SOTA for ASR)

Input Audio β†’ Log-Mel Spectrogram (80 dims) β†’ Conv Subsampling (4x) β†’ Linear Projection β†’ [Conformer Block Γ— N] β†’ CTC / Attention Head Conformer Block: Input ↓ Feed-Forward Module (Β½ scaling) ↓ Multi-Head Self-Attention Module ↓ Convolution Module (depthwise) ↓ Feed-Forward Module (Β½ scaling) ↓ LayerNorm ↓ Output Convolution Module: LayerNorm β†’ Pointwise Conv β†’ GLU β†’ Depthwise Conv β†’ BatchNorm β†’ Swish activation β†’ Pointwise Conv β†’ Dropout

5.3 Whisper Architecture Detail

Encoder: Log-Mel Spectrogram (80 Γ— 3000 frames for 30s) β†’ 2Γ— Conv1D (stride 1, 2) + GELU β†’ Sinusoidal Positional Encoding β†’ Transformer Encoder Blocks (6–32 layers depending on model) Each block: Self-Attention + FFN + LayerNorm (pre-norm) Decoder: Special tokens: <|startoftranscript|> <|language|> <|task|> <|notimestamps|> β†’ Token Embedding + Learned Positional Encoding β†’ Transformer Decoder Blocks (6–32 layers) Each block: Masked Self-Attention + Cross-Attention + FFN β†’ Linear β†’ Softmax over vocab (51865 tokens)

5.4 TTS Architecture Family Tree

Classical Era β”œβ”€β”€ Formant synthesis (rule-based, 1960s–) β”œβ”€β”€ Concatenative TTS (unit selection, 1990s–) β”‚ └── Record many hours β†’ select and concatenate units └── HMM-based TTS (HTS, 2000s–) └── STRAIGHT/WORLD vocoder Neural Era β”œβ”€β”€ Seq2Seq + Attention β”‚ β”œβ”€β”€ Tacotron 1 (2017): CBHG + Griffin-Lim β”‚ └── Tacotron 2 (2017): BiLSTM + WaveNet vocoder β”‚ β”œβ”€β”€ Parallel / Non-autoregressive β”‚ β”œβ”€β”€ FastSpeech 1 (2019): FFT + duration from teacher β”‚ β”œβ”€β”€ FastSpeech 2 (2020): Duration/pitch/energy predictors β”‚ β”œβ”€β”€ SpeedySpeech (2020) β”‚ └── JETS (2022): E2E with alignment learning β”‚ β”œβ”€β”€ Normalizing Flow Based β”‚ β”œβ”€β”€ Glow-TTS (2020): Flow-based alignment + generation β”‚ └── VITS (2021): E2E VAE + flows + HiFi-GAN β”‚ β”œβ”€β”€ Diffusion Based β”‚ β”œβ”€β”€ DiffTTS (2021) β”‚ β”œβ”€β”€ Grad-TTS (2021): Score-based diffusion β”‚ └── NaturalSpeech (2022): VITS + diffusion β”‚ LLM/Codec Era (2023–present) β”œβ”€β”€ VALL-E (2023): AR + NAR codec language model β”œβ”€β”€ SPEAR-TTS (2023): Self-supervised TTS β”œβ”€β”€ Voicebox (2023): Flow matching β”œβ”€β”€ NaturalSpeech 3 (2024): FACodec + diffusion β”œβ”€β”€ F5-TTS (2024): Flow matching + Flat text └── CosyVoice (2024): LLM + flow matching

5.5 VITS Architecture Detail (Recommended Starting Point)

TEXT INPUT ↓ [Text Encoder] Phoneme embedding β†’ Transformer encoder β†’ Prior distribution ΞΌ,Οƒ [Stochastic Duration Predictor] Flow-based duration prediction [Length Regulator] Expand phoneme representations to frame length [Decoder / Flow-based Posterior] VAE encoder: mel β†’ latent z Normalizing flows: transforms z [HiFi-GAN Generator] (Vocoder) z β†’ raw waveform [Discriminators] (training only) Multi-Period Discriminator (MPD) Multi-Scale Discriminator (MSD) LOSS = Mel loss + KL divergence + Duration loss + GAN loss + Feature matching loss

5.6 HiFi-GAN Architecture Detail

Generator: Input: Mel Spectrogram (80 Γ— T) β†’ Transposed Conv (Γ—4 upsample) β†’ MRF Block β†’ repeat until audio rate MRF Block = Multi-Receptive Field Fusion = ResBlock(k=3) + ResBlock(k=7) + ResBlock(k=11) Each ResBlock: dilated conv with rates [1,3,5] Output: Raw waveform at 22050Hz Multi-Period Discriminator (MPD): Periods p = [2, 3, 5, 7, 11] Reshape waveform into (T/p, p) β†’ Conv2D per period Multi-Scale Discriminator (MSD): Operate at 3 scales: raw, Γ—2 avg pooled, Γ—4 avg pooled

6. DESIGN & DEVELOPMENT PROCESS

6.1 STT Development from Scratch

Step 1: Data Collection & Preparation

Sources: - LibriSpeech: 960h clean English (openslr.org) - CommonVoice: Mozilla multilingual crowdsourced - VoxPopuli: EU parliament recordings - FLEURS: Google multilingual - Custom: Record, transcribe, verify Pipeline: raw_audio β†’ segment_by_vad β†’ normalize_loudness β†’ resample_to_16kHz β†’ verify_transcript β†’ create_manifest_json Manifest format: {"audio_filepath": "path/to/audio.wav", "duration": 3.2, "text": "hello world"}

Step 2: Feature Extraction

import torchaudio import torchaudio.transforms as T def extract_mel_spectrogram(waveform, sample_rate=16000): mel_transform = T.MelSpectrogram( sample_rate=sample_rate, n_fft=400, # ~25ms window at 16kHz hop_length=160, # ~10ms hop n_mels=80, f_min=80, f_max=7600 ) log_mel = torch.log(mel_transform(waveform) + 1e-9) return log_mel # Shape: (80, T)

Step 3: Model Architecture (Conformer CTC)

class ConformerASR(nn.Module): def __init__(self, input_dim=80, vocab_size=29, d_model=256, num_heads=4, num_layers=6): super().__init__() self.conv_subsample = Conv2dSubsampling(input_dim, d_model) self.encoder = nn.ModuleList([ ConformerBlock(d_model, num_heads) for _ in range(num_layers) ]) self.ctc_head = nn.Linear(d_model, vocab_size) def forward(self, x, x_lengths): x, x_lengths = self.conv_subsample(x, x_lengths) for block in self.encoder: x = block(x) logits = self.ctc_head(x) return logits, x_lengths

Step 4: Training Loop

from torch.nn import CTCLoss criterion = CTCLoss(blank=0, zero_infinity=True) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100) for batch in dataloader: audio, audio_len, tokens, token_len = batch logits, out_len = model(audio, audio_len) log_probs = F.log_softmax(logits.transpose(0,1), dim=-1) loss = criterion(log_probs, tokens, out_len, token_len) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() optimizer.zero_grad()

Step 5: Decoding

# Greedy CTC decoding def greedy_decode(logits, blank_id=0): predicted = torch.argmax(logits, dim=-1) decoded = [] prev = blank_id for p in predicted: if p != blank_id and p != prev: decoded.append(p.item()) prev = p return decoded # Beam search with language model (use pyctcdecode) from pyctcdecode import build_ctcdecoder decoder = build_ctcdecoder(vocab, kenlm_model="lm.arpa", alpha=0.5, beta=1.0) text = decoder.decode(logits.numpy())

Step 6: Evaluation

from jiwer import wer, cer # Word Error Rate error_rate = wer(reference_texts, hypothesis_texts) char_error = cer(reference_texts, hypothesis_texts) print(f"WER: {error_rate:.2%}, CER: {char_error:.2%}")

6.2 TTS Development from Scratch

Step 1: Data Collection & Preparation

Datasets: - LJ Speech: 24h single speaker English (ljspeech.github.io) - VCTK: 109 English speakers - LibriTTS: 585h multi-speaker - HiFi-TTS: High quality multi-speaker - Custom: Studio-quality recording (quiet room, good mic) Recording specs for custom: - 44.1kHz or 48kHz, 24-bit, mono - Acoustic treatment (no echo/reverb) - Consistent mic distance (15–20cm) - Phonetically balanced scripts - 1–10 hours for fine-tuning; 20+ for training from scratch Preprocessing: audio β†’ normalize_to_-20dBFS β†’ trim_silence β†’ resample_22050Hz β†’ extract_mel β†’ create_filelists (train|val|test)

Step 2: Text Frontend

import phonemizer from phonemizer.backend import EspeakBackend backend = EspeakBackend('en-us', preserve_punctuation=True, with_stress=True) def text_to_phonemes(text): # Normalize text first text = normalize_numbers(text) # "123" β†’ "one hundred twenty three" text = expand_abbreviations(text) # "Dr." β†’ "Doctor" # Convert to phonemes phonemes = backend.phonemize([text])[0] return phonemes # Phoneme to ID mapping phoneme_to_id = {p: i for i, p in enumerate(PHONEME_SET)}

Step 3: FastSpeech 2 Model

class FastSpeech2(nn.Module): def __init__(self): super().__init__() self.encoder = FFTTransformer(n_layers=4, d_model=256, n_heads=2) self.variance_adaptor = VarianceAdaptor(d_model=256) self.decoder = FFTTransformer(n_layers=4, d_model=256, n_heads=2) self.mel_linear = nn.Linear(256, 80) def forward(self, phoneme_ids, duration_target=None, pitch_target=None, energy_target=None): x = self.encoder(phoneme_ids) x, duration, pitch, energy = self.variance_adaptor( x, duration_target, pitch_target, energy_target ) x = self.decoder(x) mel = self.mel_linear(x) return mel, duration, pitch, energy

Step 4: HiFi-GAN Vocoder Training

# Train HiFi-GAN on mel β†’ waveform # Generator loss mel_loss = F.l1_loss(mel_fake, mel_real) gan_loss = generator_adversarial_loss(disc_fake_outputs) feature_match = feature_matching_loss(disc_real_features, disc_fake_features) loss_G = mel_loss * 45 + gan_loss + feature_match * 2 # Discriminator loss loss_D = discriminator_loss(disc_real_outputs, disc_fake_outputs)

Step 5: End-to-End Inference Pipeline

class TTSPipeline: def __init__(self, tts_model, vocoder): self.tts = tts_model self.vocoder = vocoder def synthesize(self, text, speed=1.0): # 1. Text β†’ phonemes phonemes = text_to_phonemes(text) phoneme_ids = text_to_ids(phonemes) # 2. TTS model β†’ mel spectrogram with torch.no_grad(): mel, *_ = self.tts( torch.LongTensor(phoneme_ids).unsqueeze(0), d_control=speed ) # 3. Vocoder β†’ waveform with torch.no_grad(): audio = self.vocoder(mel) return audio.squeeze().cpu().numpy()

7. REVERSE ENGINEERING EXISTING SYSTEMS

7.1 Why Reverse Engineering?

  • Learn from production-grade code
  • Understand design decisions
  • Identify optimizations for your use case
  • Build intuition faster than pure theory

7.2 How to Reverse Engineer Whisper

Step 1: Read the Paper

  • "Robust Speech Recognition via Large-Scale Weak Supervision" (Radford et al. 2022)
  • Note: architecture section, training details, data section

Step 2: Clone and Explore Code

git clone https://github.com/openai/whisper # Key files: # whisper/model.py β€” Architecture # whisper/audio.py β€” Feature extraction # whisper/decoding.py β€” Beam search decoder # whisper/tokenizer.py β€” BPE tokenizer

Step 3: Trace Forward Pass

import whisper model = whisper.load_model("tiny") # Trace: audio β†’ features audio = whisper.load_audio("speech.wav") mel = whisper.log_mel_spectrogram(audio) # (80, 3000) # Encoder encoded = model.encoder(mel.unsqueeze(0)) # (1, 1500, 384) for tiny # Decoder (autoregressive) tokens = [model.tokenizer.sot] # start of transcript token for _ in range(100): logits = model.decoder(torch.tensor([tokens]), encoded) next_token = logits[0, -1].argmax().item() if next_token == model.tokenizer.eot: break tokens.append(next_token)

Step 4: Profile Bottlenecks

import torch.profiler with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: result = model.transcribe("audio.wav") print(prof.key_averages().table(sort_by="cuda_time_total")) # Identify: encoder dominates (80%), decoder is 20% # Optimization: cache encoder, optimize decoder attention

Step 5: Rebuild from Scratch (Your Understanding)

# After studying, rebuild each component: class MultiHeadAttention(nn.Module): # Implement from scratch based on understanding ... class ResidualAttentionBlock(nn.Module): # Implement encoder block ... # Compare outputs to original model torch.testing.assert_close(your_output, original_output, atol=1e-4, rtol=1e-4)

7.3 How to Reverse Engineer VITS

git clone https://github.com/jaywalnut310/vits # Key files: # models.py β€” SynthesizerTrn (full model) # attentions.py β€” Transformer blocks # modules.py β€” WN (WaveNet-style), ResidualCouplingBlock (flows) # monotonic_align/ β€” MAS (Monotonic Alignment Search) # mel_processing.py β€” Mel spectrogram computation

Key insight from VITS code:

  • SynthesizerTrn.infer() is inference path (no VAE encoder needed)
  • SynthesizerTrn.forward() is training path (requires mel as target)
  • monotonic_align.maximum_path() is the alignment algorithm (Cython)

7.4 Reverse Engineering Approach Template

  1. READ paper abstract + architecture section β†’ mental model
  2. CLONE repository β†’ understand file structure
  3. TRACE data flow (print shapes at each step)
  4. IDENTIFY key components β†’ isolate each into test
  5. REPRODUCE in clean code from memory
  6. VERIFY outputs match original
  7. EXPERIMENT: change hyperparameters, observe effects
  8. OPTIMIZE: profile, identify bottlenecks, improve

8. HARDWARE REQUIREMENTS

8.1 Development Hardware

Minimum (Learning & Experimentation)
CPU: Intel Core i7 / AMD Ryzen 7 (8+ cores) RAM: 16GB (32GB preferred) GPU: NVIDIA RTX 3060 (12GB VRAM) or RTX 3070 Storage: 500GB SSD (NVMe preferred) Note: Can fine-tune small models, run inference on all models
Recommended (Training Medium Models)
CPU: Intel Core i9 / AMD Ryzen 9 / Threadripper RAM: 64GB DDR4/DDR5 GPU: NVIDIA RTX 3090 (24GB) or RTX 4090 (24GB) β€” single GPU Storage: 2TB NVMe SSD + 4TB HDD for datasets Cost: ~$3,000–$5,000 Note: Train FastSpeech2, HiFi-GAN, Conformer from scratch on LJ Speech
Research-Grade (Training Large Models)
GPU: 4Γ— RTX 4090 or 4Γ— A100 (40GB or 80GB) RAM: 256GB Storage: 10TB+ NVMe Network: 100GbE for distributed training Cost: $15,000–$40,000 Note: Train VITS, Conformer on LibriSpeech 960h
Cloud (Production Training)
AWS: p3.2xlarge β€” 1Γ— V100 16GB ($3.06/hr) p3.8xlarge β€” 4Γ— V100 64GB ($12.24/hr) p4d.24xlarge β€” 8Γ— A100 40GB ($32.77/hr) Google Cloud: a2-highgpu-1g β€” 1Γ— A100 40GB ($3.67/hr) a2-highgpu-8g β€” 8Γ— A100 40GB ($29.39/hr) Lambda Labs (cheapest GPU cloud): 1Γ— A100 80GB ~$1.29/hr 8Γ— A100 80GB ~$10.32/hr Use Spot/Preemptible instances for ~60-70% discount

8.2 VRAM Requirements by Model

Model Task VRAM (Training) VRAM (Inference)
Conformer-S (10M) ASR 8GB <1GB
Conformer-M (30M) ASR 12GB 2GB
Whisper Large v3 ASR β€” (pretrained) 6GB
FastSpeech 2 TTS 8GB <1GB
VITS TTS 12GB 2GB
HiFi-GAN Vocoder 8GB <1GB
VALL-E style TTS 40GB+ 8GB+
Whisper large fine-tune ASR 24GB 6GB

8.3 Production Inference Hardware

CPU-Only (Lightweight)
Use case: Low-traffic, edge, embedded Hardware: Modern x86 CPU with AVX2 Tools: ONNX Runtime, OpenVINO, CTranslate2 CPU Models: Whisper tiny/base, Kokoro TTS Latency: 1–5x real-time (RTF > 1)
GPU Server (Production)
Use case: High-traffic API service Hardware: NVIDIA T4 ($0.35/hr on AWS), A10G, RTX 4090 Tools: Triton Server, TensorRT, CTranslate2 GPU Models: Whisper large, VITS, XTTS Latency: 0.1–0.3x real-time (RTF 0.1–0.3)
Edge Devices
NVIDIA Jetson Orin: On-device AI, 16-64GB unified memory Apple Silicon M2/M3: Metal GPU, excellent for CoreML models Raspberry Pi 5: Light STT only (Vosk, Whisper tiny) Android/iOS: TFLite, ONNX Mobile models

9. BUILDING YOUR OWN SERVICE

9.1 System Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ CLIENT LAYER β”‚ β”‚ Web App | Mobile | API Consumer β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ HTTPS / WebSocket β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ API GATEWAY β”‚ β”‚ Rate Limiting | Auth | Load Balance β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STT Service β”‚ β”‚ TTS Service β”‚ β”‚ FastAPI + Whisper β”‚ β”‚ FastAPI + VITS/XTTS β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ INFERENCE LAYER β”‚ β”‚ GPU Workers (Triton / CTranslate2 / ONNX) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Message Queue β”‚ β”‚ Model Registry β”‚ β”‚ (Redis/Kafka) β”‚ β”‚ (MLflow / S3) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STORAGE LAYER β”‚ β”‚ Audio Storage (S3/GCS) | DB (PostgreSQL) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

9.2 STT Service Implementation

FastAPI STT Service

from fastapi import FastAPI, UploadFile, File, WebSocket from fastapi.responses import JSONResponse import torchaudio import io from faster_whisper import WhisperModel app = FastAPI(title="STT Service") # Load model at startup model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16") @app.post("/transcribe") async def transcribe( file: UploadFile = File(...), language: str = "en", task: str = "transcribe" # or "translate" ): # Read uploaded audio audio_bytes = await file.read() audio_buffer = io.BytesIO(audio_bytes) # Transcribe segments, info = model.transcribe( audio_buffer, language=language, task=task, beam_size=5, word_timestamps=True ) result = { "language": info.language, "language_probability": info.language_probability, "duration": info.duration, "segments": [ { "start": s.start, "end": s.end, "text": s.text, "words": [{"word": w.word, "start": w.start, "end": w.end} for w in (s.words or [])] } for s in segments ] } return JSONResponse(result) @app.websocket("/stream") async def stream_transcribe(websocket: WebSocket): await websocket.accept() # Streaming implementation buffer = b"" async for data in websocket.iter_bytes(): buffer += data if len(buffer) >= 32000 * 2: # 2 seconds of 16kHz int16 # Process chunk audio = np.frombuffer(buffer, dtype=np.int16).astype(np.float32) / 32768.0 segments, _ = model.transcribe(audio, language="en") for seg in segments: await websocket.send_json({"text": seg.text, "final": False}) buffer = b""

Dockerized STT Service

FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 RUN apt-get update && apt-get install -y python3-pip ffmpeg WORKDIR /app COPY requirements.txt . RUN pip install faster-whisper fastapi uvicorn python-multipart COPY . . CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

9.3 TTS Service Implementation

from fastapi import FastAPI from fastapi.responses import StreamingResponse from TTS.api import TTS import io import soundfile as sf import numpy as np app = FastAPI(title="TTS Service") # Load model tts = TTS("tts_models/en/ljspeech/vits", gpu=True) @app.post("/synthesize") async def synthesize( text: str, speaker_id: int = 0, speed: float = 1.0, format: str = "wav" ): # Generate audio wav = tts.tts(text=text, speaker=speaker_id, speed=speed) # Convert to bytes buffer = io.BytesIO() sf.write(buffer, np.array(wav), 22050, format=format.upper()) buffer.seek(0) return StreamingResponse( buffer, media_type=f"audio/{format}", headers={"Content-Disposition": f"attachment; filename=speech.{format}"} ) @app.post("/synthesize/stream") async def synthesize_stream(text: str): """Stream audio chunks as they're generated""" async def generate(): for sentence in split_into_sentences(text): wav = tts.tts(text=sentence) audio_bytes = wav_to_bytes(wav) yield audio_bytes return StreamingResponse(generate(), media_type="audio/wav")

9.4 Voice Cloning Service

# Using XTTS for zero-shot voice cloning from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts config = XttsConfig() config.load_json("XTTS-v2/config.json") model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="XTTS-v2/", eval=True) model.cuda() @app.post("/clone") async def clone_voice( text: str, reference_audio: UploadFile = File(...), language: str = "en" ): # Save reference audio temporarily ref_bytes = await reference_audio.read() ref_path = f"/tmp/{uuid.uuid4()}.wav" with open(ref_path, "wb") as f: f.write(ref_bytes) # Compute speaker latents gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( audio_path=[ref_path] ) # Synthesize out = model.inference( text=text, language=language, gpt_cond_latent=gpt_cond_latent, speaker_embedding=speaker_embedding, temperature=0.7 ) buffer = io.BytesIO() sf.write(buffer, out["wav"], 24000, format="WAV") buffer.seek(0) return StreamingResponse(buffer, media_type="audio/wav")

9.5 Deployment with Docker Compose

version: '3.8' services: stt-service: build: ./stt ports: ["8001:8000"] deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] volumes: - ./models:/models environment: - MODEL_PATH=/models/whisper-large-v3 tts-service: build: ./tts ports: ["8002:8000"] deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] volumes: - ./models:/models nginx: image: nginx:alpine ports: ["80:80", "443:443"] volumes: - ./nginx.conf:/etc/nginx/nginx.conf - ./certs:/etc/ssl/certs depends_on: [stt-service, tts-service] redis: image: redis:7-alpine ports: ["6379:6379"] prometheus: image: prom/prometheus ports: ["9090:9090"] volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml

10. BUILD IDEAS: BEGINNER β†’ ADVANCED

🟒 BEGINNER LEVEL (0–3 months)

Project 1: Voice Recorder + Transcriber

BEGINNER
Stack: Python + OpenAI Whisper + sounddevice - Record from microphone with sounddevice - Save as WAV file - Pass to Whisper for transcription - Save transcript as text file Learning: Audio I/O, Whisper API, file handling

Project 2: Text-to-Speech Converter

BEGINNER
Stack: Python + Coqui TTS or pyttsx3 - Accept text input via CLI - Generate speech audio - Play back or save to file - Support multiple voices Learning: TTS API, audio output, voice selection

Project 3: Meeting Transcriber

BEGINNER
Stack: Python + Whisper + pyaudio + tkinter - Simple GUI with start/stop recording - Real-time transcription display - Export to .txt or .docx - Speaker diarization (basic) Learning: GUI, audio streaming, file export

Project 4: Language Learning Pronunciation Checker

BEGINNER
Stack: Python + Whisper + phonemizer - User reads a sentence aloud - Compare recognized phonemes to expected phonemes - Score pronunciation accuracy - Highlight mispronounced words Learning: Phoneme comparison, scoring, feedback

🟑 INTERMEDIATE LEVEL (3–8 months)

Project 5: Real-Time Transcription Web App

INTERMEDIATE
Stack: FastAPI + WebSocket + Whisper + React - Browser captures microphone stream (MediaRecorder API) - Streams audio chunks to WebSocket server - Server transcribes chunks with streaming Whisper - Live captions displayed in browser - Export transcript feature Learning: WebSockets, browser audio API, streaming inference

Project 6: Podcast TTS Generator

INTERMEDIATE
Stack: FastSpeech2 + HiFi-GAN + FastAPI + React - Input: article URL or text - Extract text from URL (newspaper3k) - Normalize text (numbers, abbreviations) - Generate speech with FastSpeech2 + HiFi-GAN - Return downloadable MP3 - Add playback controls in UI Learning: TTS pipeline, text normalization, web scraping, audio encoding

Project 7: Voice Command System

INTERMEDIATE
Stack: Whisper + intent classification + TTS response - Wake word detection (Picovoice Porcupine or custom) - STT for command capture - Intent extraction (fine-tuned BERT or regex) - Execute command (volume, calendar, search, etc.) - TTS response Learning: Wake word detection, intent classification, action execution

Project 8: Multi-Speaker Diarization + Transcription

INTERMEDIATE
Stack: Whisper + pyannote.audio + spaCy - Transcribe audio with word timestamps - Run speaker diarization (pyannote) - Assign speakers to transcribed words - Output: "Speaker 1: Hello... Speaker 2: Hi..." - Format as readable transcript Learning: Diarization, timestamp alignment, NLP post-processing

Project 9: Fine-Tuned Domain ASR

INTERMEDIATE
Stack: Whisper + HuggingFace + custom medical/legal corpus - Collect domain-specific audio+transcripts - Fine-tune Whisper small or medium - Evaluate domain-specific WER improvement - Deploy via FastAPI Learning: Transfer learning, dataset preparation, evaluation, deployment

Project 10: Custom Voice Cloner

INTERMEDIATE
Stack: XTTS-v2 or YourTTS + FastAPI - API endpoint: /clone with {text, reference_audio} - Accept 5–15s reference audio - Generate speech in cloned voice - Multiple language support Learning: Voice cloning, speaker embeddings, API design

πŸ”΄ ADVANCED LEVEL (8–18 months)

Project 11: Production STT Service (Commercial Grade)

ADVANCED
Features: - Multi-language detection and transcription - Real-time streaming (WebSocket) + batch API (REST) - Speaker diarization - Custom vocabulary / hotwords boosting - Punctuation and capitalization restoration - Confidence scores per word - Webhook callbacks for async jobs - Dashboard: usage, latency, error rates Stack: Whisper large + NeMo + pyannote + FastAPI + Redis + PostgreSQL + Prometheus + Grafana + Kubernetes + Nginx Scaling: Horizontal pod autoscaling based on GPU queue depth

Project 12: Production TTS Service (API like ElevenLabs)

ADVANCED
Features: - 20+ pre-built voices with distinct personalities - Zero-shot voice cloning from <30s reference - Emotion/style control (happy, sad, excited, whisper) - SSML support (rate, pitch, emphasis, break) - Streaming audio generation - 20+ language support - REST API + Python/JS SDK - Usage billing integration Stack: VITS + XTTS + HiFi-GAN + FastAPI + Stripe + Redis + S3 + CDN

Project 13: Voice Conversion System

ADVANCED
Features: - Convert speaker identity while preserving content - Any-to-any voice conversion - Real-time capability (<200ms latency) Architecture: Input audio β†’ ASR (content) + Speaker encoder (style) β†’ Voice decoder β†’ Target voice audio Models: FreeVC, DDSP-VC, Diff-VC, QuickVC Learning: Disentanglement, speaker representation, real-time processing

Project 14: End-to-End Speech Translation

ADVANCED
Architecture: Source language audio β†’ SeamlessM4T / NLLB-Audio β†’ Target language text (or audio) Features: - Direct speech-to-speech translation (no text intermediate) - 100+ language pairs - Real-time streaming - Preserve prosody/emotion in output Stack: SeamlessM4T (Meta) + FastAPI + WebSocket

Project 15: Train Your Own TTS from Scratch

ADVANCED
Steps: 1. Record 20+ hours of custom voice in studio 2. Segment and transcribe all audio (Whisper-assisted) 3. Train FastSpeech2 acoustic model from scratch 4. Train HiFi-GAN vocoder from scratch 5. Fine-tune VITS end-to-end 6. Implement MOS (Mean Opinion Score) evaluation 7. A/B test against Coqui/ElevenLabs Learning: Full training pipeline, data curation, model evaluation, production deployment

πŸ”΅ RESEARCH / EXPERT LEVEL (18+ months)

Project 16: Codec Language Model TTS (VALL-E style)

RESEARCH
Architecture: Text β†’ Phonemes β†’ Token sequence β†’ AR Transformer β†’ Coarse codec tokens β†’ NAR Transformer β†’ Fine codec tokens β†’ EnCodec decoder β†’ Audio Training: - Pretrain on 10,000+ hours of diverse speech - EnCodec tokenizer (8 codebooks, 75Hz) - GPT-style LM for coarse tokens - BERT-style masked model for fine tokens Innovation opportunities: - Better alignment between text and audio tokens - Emotion conditioning - Efficiency improvements

Project 17: Streaming On-Device STT (Mobile)

RESEARCH
Target: <50ms latency, <100MB model, runs on phone CPU Approach: - Start with Conformer-Tiny + CTC - Quantize to INT8 - Optimize with TFLite delegate or CoreML - Implement streaming chunk processing - Add on-device LM rescoring (tiny n-gram) Platforms: Android (TFLite) + iOS (CoreML) Learning: Mobile ML optimization, quantization, edge deployment

Project 18: Multilingual Universal Speech Model

RESEARCH
Scope: Single model for 50+ languages, STT + TTS STT: - Pretrain wav2vec 2.0 on 50-language corpus - Fine-tune with multilingual CTC - Adapter modules per language TTS: - Shared phoneme inventory across languages - Language embedding conditioning - Cross-lingual transfer for low-resource languages Evaluation: FLEURS benchmark across all languages

11. CUTTING-EDGE DEVELOPMENTS (2023–2025)

11.1 Speech Recognition

  • Whisper Large v3 Turbo (2024): 8-layer decoder, 809M params, faster than large-v2 with similar accuracy
  • Distil-Whisper (2023, Hugging Face): 6x speedup, 49% fewer params, <1% WER degradation
  • Universal-1 (AssemblyAI, 2024): SOTA commercial STT, best on noisy data
  • Gemini Audio (Google, 2024): Natively multimodal, audio reasoning
  • Canary-1B (NVIDIA, 2024): Conformer + attention, multilingual, speech translation
  • MMS (Meta, 2023): 1000+ language STT using one model
  • OWSM (2024): Open-source replica of Whisper at larger scale (25k hours+)
  • parakeet-tdt (NVIDIA, 2024): Token-and-duration transducer, near real-time

11.2 Text-to-Speech

  • VALL-E 2 (Microsoft, 2024): First TTS to match human quality on VCTK/LibriSpeech, using GRP
  • NaturalSpeech 3 (Microsoft, 2024): FACodec disentanglement + diffusion
  • CosyVoice (Alibaba, 2024): LLM-based TTS with flow matching
  • F5-TTS (2024): Flow matching TTS, DiT architecture, flat text input, SOTA
  • E2-TTS (2024): Simple flow-matching TTS, impressive quality
  • FireRedTTS (2024): High-quality Chinese TTS
  • Kokoro (2024): Small (82M), fast, open-weights, near-SOTA quality
  • StyleTTS 2 (2023): Diffusion + style modeling, SOTA on LJ Speech
  • Parler-TTS (2024): Natural language description controls TTS voice
  • Amphion (2024): Unified open-source TTS/VC/SVC framework
  • HierSpeech++ (2024): Hierarchical variational inference, high quality

11.3 Voice & Audio Foundation Models

  • EnCodec (Meta, 2022): Neural audio codec, 24kHz, residual VQ
  • DAC (Descript, 2023): Improved neural codec, better perceptual quality
  • AudioPaLM (Google, 2023): Multimodal LLM combining speech + text
  • SpeechX (Microsoft, 2023): Unified speech model for many tasks
  • UniAudio (2023): One model for 11 audio tasks
  • VoxtLM (2024): Language model for joint speech-text
  • Spirit LM (Meta, 2024): Interleaved speech-text LLM with expressive speech

11.4 Voice Cloning & Conversion

  • OpenVoice v2 (2024): Near-zero-shot cloning, tone/style/accent control
  • XTTS v2 (Coqui, 2023): 17-language voice cloning, 6s reference
  • RVC v2: Real-Time Voice Conversion, widely used for singing conversion
  • So-VITS-SVC: Singing voice conversion based on VITS
  • Seed-TTS (ByteDance, 2024): Near-perfect voice cloning, emotional control

11.5 Real-Time & Streaming

  • Moshi (Kyutai, 2024): Real-time full-duplex speech dialogue system
  • RealtimeTTS: Python library for ultra-low-latency streaming TTS
  • moonshine (Useful Sensors, 2024): On-device STT, faster than Whisper tiny
  • whisper.cpp: C++ Whisper, runs on CPU, iOS, Android, Raspberry Pi

11.6 Key Research Directions (2025+)

  • Speech LLMs: End-to-end spoken dialogue models (like GPT-4o audio)
  • Zero-shot multilingual TTS: One model, any language, any voice
  • Codec-based unified models: Everything tokenized as audio codes
  • On-device streaming: Sub-100ms full-stack STT+TTS on mobile
  • Emotional speech: Expressive control beyond speed/pitch
  • Personalization: Continuous adaptation from user speech
  • Anti-spoofing: Detecting deepfake audio (ADD challenge)

12. RESOURCES, DATASETS & REFERENCES

12.1 Key Research Papers (Read in Order)

STT Papers
  1. "A tutorial on hidden Markov models" (Rabiner, 1989)
  2. "Deep Speech: Scaling up end-to-end speech recognition" (Baidu, 2014)
  3. "Connectionist Temporal Classification" (Graves et al., 2006)
  4. "Attention-Based Models for Speech Recognition" (Chorowski, 2015)
  5. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech" (Meta, 2020)
  6. "HuBERT: Self-Supervised Speech Representation Learning" (Meta, 2021)
  7. "Conformer: Convolution-augmented Transformer for SR" (Google, 2020)
  8. "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper, OpenAI, 2022)
  9. "Distil-Whisper: Robust Knowledge Distillation" (Hugging Face, 2023)
TTS Papers
  1. "WaveNet: A Generative Model for Raw Audio" (DeepMind, 2016)
  2. "Tacotron: Towards End-to-End Speech Synthesis" (Google, 2017)
  3. "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" (Tacotron 2, 2018)
  4. "FastSpeech: Fast, Robust and Controllable TTS" (Microsoft, 2019)
  5. "FastSpeech 2: Fast and High-Quality E2E TTS" (Microsoft, 2020)
  6. "HiFi-GAN: Generative Adversarial Networks for Audio Synthesis" (2020)
  7. "VITS: Conditional Variational Autoencoder with Adversarial Learning for E2E TTS" (2021)
  8. "VALL-E: Neural Codec Language Models are Zero-Shot TTS" (Microsoft, 2023)
  9. "Voicebox: Text-Guided Multilingual Universal Speech Generation" (Meta, 2023)
  10. "NaturalSpeech 3: Zero-Shot Copier-Free Voice Cloning" (Microsoft, 2024)
  11. "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech" (2024)

12.2 Datasets

ASR Datasets
  • LibriSpeech β€” 960h English, clean+noisy (openslr.org/12)
  • CommonVoice 17 β€” Mozilla, 100+ languages (commonvoice.mozilla.org)
  • VoxPopuli β€” 1791h EU parliament (github.com/facebookresearch/voxpopuli)
  • GigaSpeech β€” 10,000h English diverse (github.com/SpeechColab/GigaSpeech)
  • AISHELL-1/2 β€” Mandarin Chinese (openslr.org/33)
  • MCV Corpus β€” Hindi, Marathi, Tamil, Telugu + many Indian languages
  • MUSAN β€” Noise, music, speech for augmentation
  • RIR_NOISES β€” Room impulse responses
  • FLEURS β€” Google, 100 languages, 12h each
TTS Datasets
  • LJ Speech β€” 24h single speaker, high quality (keithito.com/LJ-Speech-Dataset)
  • VCTK β€” 109 speakers, English (datashare.ed.ac.uk)
  • LibriTTS β€” 585h multi-speaker, clean (openslr.org/60)
  • HiFi-TTS β€” 291h high quality multi-speaker
  • AISHELL-3 β€” 85h Mandarin multi-speaker
  • CSS10 β€” 10 languages, single speaker each
  • Kokoro dataset β€” High-quality curated English
  • ESD β€” Emotional speech dataset (5 emotions, 10 speakers)

12.3 Pre-trained Models to Start With

STT
  • openai/whisper-large-v3 (HuggingFace)
  • nvidia/parakeet-tdt-1.1b (HuggingFace)
  • facebook/wav2vec2-large-960h-lv60 (HuggingFace)
  • speechbrain/asr-conformer-... (SpeechBrain Hub)
TTS
  • tts_models/en/ljspeech/vits (Coqui TTS)
  • tts_models/multilingual/multi-dataset/xtts_v2 (Coqui)
  • hexgrad/Kokoro-82M (HuggingFace)
  • facebook/mms-tts-eng (HuggingFace)
  • suno/bark (HuggingFace)
Speaker
  • speechbrain/spkrec-ecapa-voxceleb (Speaker verification)
  • pyannote/speaker-diarization-3.1 (Diarization)

12.4 Courses & Learning Resources

COURSES
  • Stanford CS224S: Spoken Language Processing (free online)
  • Fast.ai Practical Deep Learning (free)
  • DeepLearning.AI Sequence Models (Coursera)
  • CMU 11-751 Speech Recognition (lecture slides free)
  • Hugging Face Audio Course (huggingface.co/learn/audio-course β€” FREE)
BOOKS
  • "Speech and Language Processing" β€” Jurafsky & Martin (free PDF: web.stanford.edu/~jurafsky/slp3)
  • "Fundamentals of Speech Recognition" β€” Rabiner & Juang
  • "Deep Learning" β€” Goodfellow, Bengio, Courville (free: deeplearningbook.org)
  • "Neural Network Methods for NLP" β€” Goldberg
KEY BLOGS & RESOURCES
  • Lilian Weng's Blog (lilianweng.github.io) β€” Excellent deep dives
  • Papers With Code (paperswithcode.com/task/speech-synthesis)
  • Hugging Face Blog
  • NVIDIA Developer Blog
  • Distill.pub β€” Visual explanations
COMMUNITIES
  • r/MachineLearning (Reddit)
  • Hugging Face Discord
  • ESPnet GitHub Discussions
  • SpeechBrain Slack
  • ML Discord servers

12.5 Benchmarks & Evaluation Tools

STT Benchmarks
  • LibriSpeech test-clean: Target WER < 2.5% (SOTA ~1.4%)
  • LibriSpeech test-other: Target WER < 5% (SOTA ~2.7%)
  • CommonVoice: Multilingual WER
  • Earnings21: Real-world earnings call transcription
  • CHiME-6: Noisy far-field challenge
  • NOIZEUS: Noise robustness
TTS Evaluation
  • MOS (Mean Opinion Score): Human evaluation 1-5 scale
  • UTMOS: Automatic MOS predictor (neural)
  • DNSMOS P.835: Noise/speech quality
  • SpeechBERTScore: Semantic similarity
  • PESQ: Perceptual speech quality
  • STOI: Short-Time Objective Intelligibility
  • F0 RMSE: Pitch prediction accuracy
  • MCD (Mel Cepstral Distortion): Acoustic similarity
Tools for Evaluation
# WER calculation pip install jiwer from jiwer import wer, cer # MOS prediction pip install speechmos from speechmos import dnsmos # PESQ and STOI pip install pesq pystoi # Forced alignment (for TTS duration evaluation) pip install montreal-forced-aligner

12.6 QUICK START CHECKLIST

Week 1: Environment Setup
  • Install Python 3.10+, CUDA, PyTorch with GPU support
  • Install librosa, torchaudio, transformers, TTS (Coqui)
  • Download and run Whisper on a test file
  • Run Coqui TTS on a test sentence
  • Plot a mel spectrogram from scratch
Week 2–4: Foundations
  • Implement MFCC from scratch (no librosa)
  • Implement simple HMM for digit recognition
  • Fine-tune Whisper base on a custom 1-hour dataset
  • Run VITS inference on LJ Speech
Month 2–3: First Models
  • Train FastSpeech 2 on LJ Speech (2–3 days on RTX 3090)
  • Train HiFi-GAN vocoder
  • Build a REST API for STT and TTS
  • Build a simple web UI for your service
Month 4–6: Production Service
  • Containerize with Docker
  • Add authentication, rate limiting
  • Add monitoring (Prometheus + Grafana)
  • Deploy to cloud (AWS/GCP/Lambda Labs)
  • Achieve <300ms TTS latency
Month 7–12: Specialization
  • Choose: voice cloning, multilingual, on-device, or real-time streaming
  • Train a model from scratch on custom data
  • Publish an open-source project or demo
  • Read 5+ papers from the cutting-edge list